Clustering Full Text Documents
نویسنده
چکیده
An index or topic hierarchy of full-text documents can organize a domain and speed information retrieval. Traditional indexes, like the Library of Congress system or Dewey Decimal system, are generated by hand, updated infrequently, and applied inconsistently. With machine learning, they can be generated automatically, updated as new documents arrive, and applied consistently. Despite the appeal of automatic indexing, organizing natural language documents is a difficult balance between what we want to do and what we can do. This paper describes an application of clustering to full-text databases, presents a new clustering method, and discusses the data engineering necessary to use clustering for this application. In particular, the paper deals with engineering the feature set to permit learning and otherwise engineering the data to match assumptions underlying the learning algorithm.
منابع مشابه
خوشهبندی اسناد مبتنی بر آنتولوژی و رویکرد فازی
Data mining, also known as knowledge discovery in database, is the process to discover unknown knowledge from a large amount of data. Text mining is to apply data mining techniques to extract knowledge from unstructured text. Text clustering is one of important techniques of text mining, which is the unsupervised classification of similar documents into different groups. The most important step...
متن کاملA Survey on Text Mining in Clustering
Text mining has important applications in the area of data mining and information retrieval. One of the important tasks in text mining is document clustering. Many existing document clustering techniques use the bag-of-words model to represent the content of a document. It is only effective for grouping related documents when these documents share a large proportion of lexically equivalent term...
متن کاملFaster Full Text Search through Document Clustering Diploma Thesis
Fast and easy access to information has become a keystone in our fast-paced world. Full text search remains an important technique in the area of information retrieval and excels wherever fast access to large amounts of text is a prime concern. Improving full text search is still an active research area. Faster full text search leads to higher throughput, reduced hardware costs and an overall i...
متن کاملخوشهبندی فراابتکاری اسناد فارسی اِکساِماِل مبتنی بر شباهت ساختاری و محتوایی
Due to the increasing number of documents, XML, effectively organize these documents in order to retrieve useful information from them is essential. A possible solution is performed on the clustering of XML documents in order to discover knowledge. Clustering XML documents is a key issue of how to measure the similarity between XML documents. Conventional clustering of text documents using a do...
متن کاملInverted Index based Modified Version of K-Means Algorithm for Text Clustering
This research proposes a new strategy where documents are encoded into string vectors and modified version of k means algorithm to be adaptable to string vectors for text clustering. Traditionally, when k means algorithm is used for pattern classification, raw data should be encoded into numerical vectors. This encoding may be difficult, depending on a given application area of pattern classifi...
متن کامل